The goal of this analysis is to discover insights into US aviation accidents over the last 80 years.
The NTSB aviation accident database contains information from 1962 and later about civil aviation accidents and selected incidents within the United States, its territories and possessions, and in international waters. Generally, a preliminary report is available online within a few days of an accident. Factual information is added when available, and when the investigation is completed, the preliminary report is replaced with a final description of the accident and its probable cause. Full narrative descriptions may not be available for dates before 1993, cases under revision, or where NTSB did not have primary investigative responsibility.
For each observation (accident), NTSB provides 33 variables detailing the accident. Accident details include a mix of numeric (e.g., number of fatalities), categorical (e.g., engine type), and textual (e.g., written narrative of the incident) data.
I began by examining the distributions and frequencies of numerical and categorical variables. Next,I proceeded to analyze the two text fields by quantifying the distributions of words and topics. Finally, I synthesized numerical, categorical and textual data to uncover comprehensive insights.
To focus the analysis, I simplified the dataset in a few small but important ways. Specifically, I did the following:
Accidents are decreasing over time. The more pronounced reduction in non-fatal accidents compared to fatal accidents could suggest that non-fatal accidents are generally easier to avoid. Accidents are more frequent during the summer months, suggesting that most crashes are related to private flights (hobbyist, etc.), as opposed to commericial flights, which have a less-pronounced annual trend.
The frequency of plane crashes outside the US is likely correlated with the volume of flights (i.e., more flights to Canada means more crashes in Canada). However, there is a striking contrast between the lethality of accidents in the northern and southern hemispheres. This could be due in part to the relaxed nature of air traffic controllers in South America (see the Tipping Point by Malcolm Gladwell) or to the fact that US flights to the southern hemisphere are necessarily longer in duration.
Private airports host the majority of accidents, lending further evidence to the notion that most accidents are related to small-plane hobbyist endeavors.
The vast majority of fatal crashes involve one or two victims. Fatal crashes involving 50 or more people have happened several times since the 1960s, but these cases are extremely rare. It is common to survive an aviation accident with no injuries, suggesting that most accidents in the NTSB database are minor. Historically, when injuries occured, they were most often fatal; however, in recent years, fatal injuries have become about as rare as non-fatal injuries, suggesting that aircraft have become safer.
Small (1-2 engine) airplane companies, including Cessna, Piper, and Beech top the list of accidents. These lighweight, propeller-powered airplanes are considerably more highly represented in the NTSB dataset than than larger, jet-engine models. Two helicoptor companies represent the majority of helicoptors in deadly accidents: Robinson and Bell.
Regarding specific models, all of the top 10 deadliest models are single-engine, 2-5 passenger planes. The Cessna 172, the most popular plane in history, is also the highest-represented in terms of deadly accidents. Robinson R44, the world’s best-selling general aviation helicoptor, tops the list for deadly helicoptors. It is important to note that the airplanes and helicoptors mentioned here are not only the most common in terms of accidents, but also the most common, generally.
Reciprocating engines, which are internal combustion engines used on propeller planes, are found in the vast majority of aviation accidents - and surely the majority of airplanes. The density plot of engines over time paints an interesting picture of the history of engines. Accidents involving the reciprocating engines peaked at the beginning of the study period and have declined over time. Turbo-charged engines, on the other hand, peaked around the turn of the century, and have decreased since the early 2000s.
About 10% of scheduled or commuter flights are fatal, compared to 25% of non-scheduled or air taxi flights. Regarding various flight missions, some endeavors are clearly more dangerous than others. For instance, about half of accidents at air/race shows are fatal. Accidents involving flights intended to fight fires are also highly dangerous. Instructional flight accidents, on the other hand, have a very low fatal:non-fatal ratio.
To assess airline safety, I quantified the ratio of fatal:non-fatal accidents for airlines that have more than 15 accidents documented in the NTSB database. More than half of accidents involving Petroleum Helicoptors (a company that ferries workers to offshore drilling platforms) were fatal. Of the major US commercial airlines, Southwest and Delta had the highest accident safety ratings.
Weather appeared to play a major role in fatal aviation accidents. For accidents with low visibility, more than half were fatal. On the contrary, only about 10% of accidents with good visibility were fatal.
Over the course of the study period, the majority of fatal accidents occurred during the cruise or maneuvering phases of flight. However, in recent years, the cruise phase has become much safer. Takeoff and landing phases accounts for almost all non-fatal accidents.
Many of the accident reports include a “probable cause” statement. I created a corpus from these statements, then quantified the term frequency-inverse document frequency (TF-IDF) to reveal the most relevant words in the corpus. Finally, I uncovered insights into relevant words for different types of accidents.
To answer this question, I constructed a regularized logistic regression model (ridge regression) to predict fatal accidents from TF-IDF word frequencies. Ridge regression is useful here because it will reduce the dimensionality of the model, i.e., eliminate words that are not very important. Then, I used the weights from the regression model to identify words that are most closely correlated with fatal accidents.
We can see that the best model (highest AUC) utilizes about 100 words to predict whether or not an accident is fatal.
Overall, the TF-IDF analysis suggests that words like “landing”, “loss”, and “fuel” are relevant terms in the corpus, meaning that they are important for distinguishing one probable cause statement from another.
By fitting a logistic regression model to the terms, I found that fatal accidents are highly correlated with instrument failure (“instrument”, “control”) and weather conditions (“weather”, “night”). On the other hand, non-fatal accidents are highly correlated with takeoff/landing (“landing”, “soft”, “touchdown”, “runway”) and windy conditions (“crosswind”, “directional”).
To further analyze the “probable cause” statements, I uncovered common themes within the text using the latent Dirichlet allocation (LDA) topic model. The LDA model posits that each document is a mixture of overlapping topics, which themselves are mixtures of overlapping words.
I identified and characterized the major topics of probable cause statements. Then, I examined the relationships between these topics and other characteristics of aviation accidents.
## K = 5; V = 143; M = 35503
## Sampling 500 iterations!
## Iteration 25 ...
## Iteration 50 ...
## Iteration 75 ...
## Iteration 100 ...
## Iteration 125 ...
## Iteration 150 ...
## Iteration 175 ...
## Iteration 200 ...
## Iteration 225 ...
## Iteration 250 ...
## Iteration 275 ...
## Iteration 300 ...
## Iteration 325 ...
## Iteration 350 ...
## Iteration 375 ...
## Iteration 400 ...
## Iteration 425 ...
## Iteration 450 ...
## Iteration 475 ...
## Iteration 500 ...
## Gibbs sampling completed!
## [1] 72.16715
## [1] 5
## K = 10; V = 143; M = 35503
## Sampling 500 iterations!
## Iteration 25 ...
## Iteration 50 ...
## Iteration 75 ...
## Iteration 100 ...
## Iteration 125 ...
## Iteration 150 ...
## Iteration 175 ...
## Iteration 200 ...
## Iteration 225 ...
## Iteration 250 ...
## Iteration 275 ...
## Iteration 300 ...
## Iteration 325 ...
## Iteration 350 ...
## Iteration 375 ...
## Iteration 400 ...
## Iteration 425 ...
## Iteration 450 ...
## Iteration 475 ...
## Iteration 500 ...
## Gibbs sampling completed!
## [1] 66.96348
## [1] 10
## K = 15; V = 143; M = 35503
## Sampling 500 iterations!
## Iteration 25 ...
## Iteration 50 ...
## Iteration 75 ...
## Iteration 100 ...
## Iteration 125 ...
## Iteration 150 ...
## Iteration 175 ...
## Iteration 200 ...
## Iteration 225 ...
## Iteration 250 ...
## Iteration 275 ...
## Iteration 300 ...
## Iteration 325 ...
## Iteration 350 ...
## Iteration 375 ...
## Iteration 400 ...
## Iteration 425 ...
## Iteration 450 ...
## Iteration 475 ...
## Iteration 500 ...
## Gibbs sampling completed!
## [1] 62.99547
## [1] 15
## K = 20; V = 143; M = 35503
## Sampling 500 iterations!
## Iteration 25 ...
## Iteration 50 ...
## Iteration 75 ...
## Iteration 100 ...
## Iteration 125 ...
## Iteration 150 ...
## Iteration 175 ...
## Iteration 200 ...
## Iteration 225 ...
## Iteration 250 ...
## Iteration 275 ...
## Iteration 300 ...
## Iteration 325 ...
## Iteration 350 ...
## Iteration 375 ...
## Iteration 400 ...
## Iteration 425 ...
## Iteration 450 ...
## Iteration 475 ...
## Iteration 500 ...
## Gibbs sampling completed!
## [1] 60.12917
## [1] 20
## K = 25; V = 143; M = 35503
## Sampling 500 iterations!
## Iteration 25 ...
## Iteration 50 ...
## Iteration 75 ...
## Iteration 100 ...
## Iteration 125 ...
## Iteration 150 ...
## Iteration 175 ...
## Iteration 200 ...
## Iteration 225 ...
## Iteration 250 ...
## Iteration 275 ...
## Iteration 300 ...
## Iteration 325 ...
## Iteration 350 ...
## Iteration 375 ...
## Iteration 400 ...
## Iteration 425 ...
## Iteration 450 ...
## Iteration 475 ...
## Iteration 500 ...
## Gibbs sampling completed!
## [1] 58.13223
## [1] 25
Let’s just use 10.
## K = 10; V = 144; M = 47336
## Sampling 500 iterations!
## Iteration 25 ...
## Iteration 50 ...
## Iteration 75 ...
## Iteration 100 ...
## Iteration 125 ...
## Iteration 150 ...
## Iteration 175 ...
## Iteration 200 ...
## Iteration 225 ...
## Iteration 250 ...
## Iteration 275 ...
## Iteration 300 ...
## Iteration 325 ...
## Iteration 350 ...
## Iteration 375 ...
## Iteration 400 ...
## Iteration 425 ...
## Iteration 450 ...
## Iteration 475 ...
## Iteration 500 ...
## Gibbs sampling completed!
## # A tibble: 6 x 3
## # Groups: State [6]
## State AirportName `Dangerous landing value`
## <chr> <chr> <dbl>
## 1 IA WATERLOO REGIONAL AIRORT 0.229
## 2 VA MERCER COUNTY 0.215
## 3 TN HUNTER FIELD 0.213
## 4 TX SINTON/SAN PATRICIO COUNTY 0.213
## 5 LA SHREVEPORT REGIONAL AIRPORT 0.210
## 6 IL VERMILLION COUNTY AIRPORT 0.209